Fault tolerant magnetoresistive solid-state storage device

ABSTRACT

A magnetoresistive solid-state storage device (MRAM) performs error correction coding (ECC) of stored information. At manufacture or during use, each logical block of ECC encoded data and/or the corresponding set of storage cells are evaluated to determine suitability for continued use, or whether remedial action is necessary. In a first preferred method ECC decoding is attempted to determine whether information is unrecoverable from the block of ECC encoded data. In a second preferred method a parametric evaluation is made prior to attempting ECC decoding.

[0001] The present invention relates in general to a magnetoresistivesolid-state storage device and to a method for controlling amagnetoresistive solid-state storage device. In particular, but notexclusively, the invention relates to a magnetoresistive solid-statestorage device employing error correction coding.

[0002] A typical solid-state storage device comprises one or more arraysof storage cells for storing data. Existing semiconductor technologiesprovide volatile solid-state storage devices suitable for relativelyshort term storage of data, such as dynamic random access memory (DRAM),or devices for relatively longer term storage of data such as staticrandom access memory (SRAM) or non-volatile flash and EEPROM devices.However, many other technologies are known or are being developed.

[0003] Recently, a magnetoresistive storage device has been developed asa new type of non-volatile solid-state storage device (see, for example,EP-A-0918334 Hewlett-Packard). The magnetoresistive solid-state storagedevice is also known as magnetic random access memory (MRAM) device.MRAM devices have relatively low power consumption and relatively fastaccess times, particularly for data write operations, which renders MRAMdevices ideally suitable for both short term and long term storageapplications.

[0004] A problem arises in that MRAM devices are subject to physicalfailure, which can result in an unacceptable loss of stored data.Currently available manufacturing techniques for MRAM devices aresubject to limitations and as a result manufacturing yields ofcommercially acceptable MRAM devices are relatively low. Although bettermanufacturing techniques are being developed, these tend to increasemanufacturing complexity and cost. Hence, it is desired to apply lowercost manufacturing techniques whilst increasing device yield. Further,it is desired to increase cell density formed on a substrate such assilicon, but as the density increases manufacturing tolerances becomeincreasingly difficult to control, again leading to higher failure ratesand lower device yields. Since the MRAM devices are at a relativelyearly stage in development, it is desired to allow large scale ismanufacturing of commercially acceptable devices, whilst tolerating thelimitations of current manufacturing techniques.

[0005] An aim of the present invention is to provide a magnetoresistivesolid-state storage device which is tolerant of at least some failures.Another aim is to provide a method for controlling a magnetoresistivesolid-state storage device to tolerate at least some failures.

[0006] A preferred aim is to provide a magnetoresistive solid-statestorage device and a method for controlling such a device which istolerant of both systematic and random failures. Other preferred aimsare to provide a magnetoresistive solid-state storage device and amethod for controlling such a device, which allows at least somefailures to be tolerated without any loss of stored data, preferablywhich is efficient to implement, preferably which allows lower costmanufacturing techniques to be employed, and preferably which allowsdevice yield to be increased.

[0007] According to a first aspect of the present invention there isprovided a method for controlling a magnetoresistive solid-state storagedevice having a plurality of storage cells for storing a block of ECCencoded data, the method comprising the steps of: accessing a set of theplurality of storage cells; and determining whether information isunrecoverable from a block of ECC encoded data stored in the accessedstorage cells.

[0008] In a first preferred embodiment, determination of whetherinformation is unrecoverable from the stored block of ECC encoded datais made by attempting to perform ECC decoding. If the ECC decodingsuccessfully recovers information from the block of ECC encoded data,then use of that set of storage cells can continue in future read andwrite access cycles. However, if the ECC decoding fails to recoverinformation from the block of ECC encoded data, then preferably remedialaction is taken concerning the set of storage cells. For example, theremedial action involves discarding that set of storage cells such thatthe set is not available in future read and write cycles.

[0009] Optionally, the method comprises identifying failed symbols inthe block of ECC encoded data, as an output from the ECC decoding step,and comparing the identified number of failed symbols against athreshold value. The threshold value suitably represents a safetymargin, such as 50% to 95% of the maximum number of failed symbols whichcan be corrected by ECC decoding the block of ECC encoded data. Thesafety margin represents the situation where, although a relatively highproportion of failed symbols have been identified in the block of ECCencoded data, it is reasonable to continue using that set of storagecells in future. Even though further systematic or random failures mightbe encountered in a future read operation, it is reasonable to expectthat the number of failed symbols will still be correctable by ECCdecoding the block of ECC encoded data.

[0010] In a second preferred embodiment of the present invention, theaccessed set of storage cells is evaluated based on parametric values,prior to attempting ECC decoding of the block of ECC encoded data.Preferably, the method comprises determining whether originalinformation is expected to be unrecoverable from the block of ECCencoded data stored in the accessed set of storage cells. In particular,it is determined whether original information is expected to beunrecoverable because the probability of failing to correctly performECC decoding is unacceptably high. Where original information is notexpected to be unrecoverable, then use of the set of storage cells maycontinue. The first and second embodiments are preferably combined, suchthat a decision to continue use of the set of storage cells, or takeremedial action, is made either after performing a parametric based testas in the second embodiment, or after performing ECC decoding as in thefirst embodiment, or a decision can be made at either stage.

[0011] Preferably, in the second embodiment, the method comprisesdetermining, from accessing the set of storage cells, failed symbols inthe block of ECC encoded data that have been affected by a physicalfailure. Suitably, a determination is made whether there are more failedsymbols in the block of ECC encoded data than can be corrected by errorcorrection decoding the block of ECC encoded data. Here, a situation isidentified where, due to physical failures, ECC decoding the block ofECC encoded data may well fail to recover the original information. Inother words, there is an unacceptable probability that decoding theblock of ECC encoded data will not correctly recover originalinformation.

[0012] Preferably, accessing the set of storage cells comprisesobtaining parametric values, which are compared against one or moreranges. Suitably, for most of the accessed set of storage cells, alogical bit value is derived, but some of the storage cells can beidentified as being affected by a physical failure. Suitably, a failurecount is determined based on the identified failed cells. The failurecount can simply represent the number of failed cells, but preferablythe failure count is based on failed symbols of the block of ECC encodeddata affected by the identified failed cells. Preferably, the failurecount is compared against a threshold value. As one option, thethreshold value represents the total number of failed symbols which canbe corrected by ECC decoding the block of ECC encoded data. As a secondoption, the threshold value represents a safety margin less than thetotal number of failed symbols correctable by ECC decoding, such asbetween about 50% to 95% of the total number. In this situation thethreshold value is particularly useful in that only some types ofphysical failures in MRAM devices can be readily identified from theobtained parametric values, and the threshold value is set such that,given the identified number of failures, it is still reasonable toperform ECC decoding, whilst allowing for an additional number of as yetunidentified failures to affect the block of ECC encoded data.

[0013] Conveniently, original information is received for storing in theMRAM device in units of a sector, such as 512 bytes. The originalinformation sector is error correction encoded to form one or moreblocks of ECC encoded data. In the preferred embodiment a linear ECCscheme such as a Reed-Solomon code is employed. Conveniently, eachsector of original information is encoded to form a sector of ECCencoded data comprising four codewords. Each codeword suitably forms theblock of ECC encoded data mentioned above.

[0014] According to a second aspect of the present invention there isprovided a method for controlling a magnetoresistive solid-state storagedevice, comprising the steps of: receiving original information which itis desired to store; error correction encoding the original informationto form a block of ECC encoded data; storing the block of ECC encodeddata in a set of magnetoresistive storage cells arranged in at least onearray; accessing the set of storage cells; forming logical symbol valuesof the block of ECC encoded data from the accessed set of storage cells;error correction decoding the block of ECC encoded data to providerecovered information; if the decoding step provided recoveredinformation then outputting the recovered information and continuing useof the set of storage cells, or else if the decoding step did notprovide recovered information then taking remedial action in respect ofthe set of storage cells.

[0015] Preferably, the method comprises identifying, from the ECCdecoding, zero or more failed symbols in the block of ECC encoded data;comparing the identified number of failed symbols against a thresholdvalue; and, if the ECC decoding did not recover original information, orif the identified number of failed symbols is greater than the thresholdvalue, then taking remedial action concerning the accessed set ofstorage cells.

[0016] According to a third aspect of the present invention there isprovided a method for controlling a magnetoresistive solid-state storagedevice, comprising the steps of: receiving original information which itis desired to store; error correction encoding the original informationto form a block of ECC encoded data; storing the block of ECC encodeddata in a set of magnetoresistive storage cells arranged in at least onearray; accessing the set of storage cells; comparing parametric valuesobtained by accessing the set of storage cells against one or moreranges; identifying failed cells amongst the accessed set of cells;forming a failure count based on the identified failed cells; comparingthe failure count against a threshold value; and determining whether theoriginal information is expected to be unrecoverable from the block ofECC encoded data stored in the accessed set of storage cells.

[0017] According to a fourth aspect of the present invention there isprovided a magnetoresistive solid-state storage device, comprising: atleast one array of magnetoresistive storage cells; a ECC encoding unitfor forming a block of ECC encoded data from a unit of originalinformation; and a controller arranged to store the block of ECC encodeddata in a set of the storage cells, access the set of storage cells, anddetermine whether the original information is unrecoverable from theblock of ECC encoded data stored in the accessed set of storage cells.

[0018] For a better understanding of the invention, and to show howembodiments of the same may be carried into effect, reference will nowbe made, by way of example, to the accompanying diagrammatic drawings inwhich:

[0019]FIG. 1 is a schematic diagram showing a preferred MRAM deviceincluding an array of storage cells;

[0020]FIG. 2 shows a preferred logical data structure;

[0021]FIG. 3 shows an overview of a preferred method for controlling anMRAM device;

[0022]FIG. 4 shows a first preferred method for controlling an MRAMdevice;

[0023]FIG. 5 shows a second preferred method for controlling an MRAMdevice; and

[0024]FIG. 6 is a graph illustrating a parametric value obtained from astorage cell of an MRAM device.

[0025] To assist a complete understanding of the present invention, anexample MRAM device will first be described with reference to FIG. 1,including a description of the failure mechanisms found in MRAM devices.The preferred methods for controlling such MRAM devices will then bedescribed with reference to FIGS. 2 to 6.

[0026]FIG. 1 shows a simplified magnetoresistive solid-state storagedevice 1 comprising an array 10 of storage cells 16. The array 10 iscoupled to a controller 20 which, amongst other control elements,includes an ECC coding and decoding unit 22. The controller 20 and thearray 10 can be formed on a single substrate, or can be arrangedseparately.

[0027] In one preferred embodiment, the array 10 comprises of the orderof 1024 by 1024 storage cells, just a few of which are illustrated. Thecells 16 are each formed at an intersection between control lines 12 and14. In this example control lines 12 are arranged in rows, and controllines 14 are arranged in columns. One row 12 and one or more columns 14are selected to access the required storage cell or cells 16 (orconversely one column and several rows, depending upon the orientationof the array). Suitably, the row and column lines are coupled to controlcircuits 18, which include a plurality of read/write control circuits.Depending upon the implementation, one read/write control circuit isprovided per column, or read/write control circuits are multiplexed orshared between columns. In this example the control lines 12 and 14 aregenerally orthogonal, but other more complicated lattice structures arealso possible.

[0028] In a read operation of the currently preferred MRAM device, asingle row line 12 and several column lines 14 (represented by thickerlines in FIG. 1) are activated in the array 10 by the control circuits18, and a set of data read from those activated cells. This operation istermed a slice. The row in this example is 1024 storage cells long 1 andthe accessed storage cells 16 are separated by a minimum readingdistance m, such as sixty-four cells, to minimise cross-cellinterference in the read process. Hence, each slice provides up tol/m=1024/64=16 bits from the accessed array.

[0029] To provide an MRAM device of a desired storage capacity,preferably a plurality of independently addressable arrays 10 arearranged to form a macro-array. Conveniently, a small plurality ofarrays 10 (typically four) are layered to form a stack, and pluralstacks are arranged together, such as in a 16×16 layout. Preferably,each macro-array has a 16×18×4 or 16×20×4 layout (expressed aswidth×height×stack layers). Optionally, the MRAM device comprises morethan one macro-array. In the currently preferred MRAM device only one ofthe four arrays in each stack can be accessed at any one time. Hence, aslice from a macro-array reads a set of cells from one row of a subsetof the plurality of arrays 10, the subset preferably being one arraywithin each stack.

[0030] Each storage cell 16 stores one bit of data suitably representinga numerical value and preferably a binary value, i.e. one or zero.Suitably, each storage cell includes two films which assume one of twostable magnetisation orientations, known as parallel and anti-parallel.The magnetisation orientation affects the resistance of the storagecell. When the storage cell 16 is in the anti-parallel state, theresistance is at its highest, and when the magnetic storage cell is inthe parallel state, the resistance is at its lowest. Suitably, theanti-parallel state defines a zero logic state, and the parallel statedefines a one logic state, or vice versa. As further backgroundinformation, EP-A-0 918 334 (Hewlett-Packard) discloses one example of amagnetoresistive solid-state storage device which is suitable for use inpreferred embodiments of the present invention.

[0031] Although generally reliable, it has been found that failures canoccur which affect the ability of the device to store data reliably inthe storage cells 16. Physical failures within a MRAM device can resultfrom many causes including manufacturing imperfections, internal effectssuch as noise in a read process, environmental effects such astemperature and surrounding electromagnetic noise, or ageing of thedevice in use. In general, failures can be classified as eithersystematic failures or random failures. Systematic failures consistentlyaffect a particular storage cell or a particular group of storage cells.Random failures occur transiently and are not consistently repeatable.Typically, systematic failures arise as a result of manufacturingimperfections and ageing, whilst random failures occur in response tointernal effects and to external environmental effects.

[0032] Failures are highly undesirable and mean that at least somestorage cells in the device cannot be written to or read from reliably.A cell affected by a failure can become unreadable, in which case nological value can be read from the cell, or can become unreliable, inwhich case the logical value read from the cell is not necessarily thesame as the value written to the cell (e.g. a “1” is written but a “0”is read). The storage capacity and reliability of the device can beseverely affected and in the worst case the entire device becomesunusable.

[0033] Failure mechanisms take many forms, and the following examplesare amongst those identified:

[0034] 1. Shorted bits—where the resistance of the storage cell is muchlower than expected. Shorted bits tend to affect all storage cells lyingin the same row and the same column.

[0035] 2. Open bits—where the resistance of the storage cell is muchhigher than expected. Open bit failures can, but do not always, affectall storage cells lying in the same row or column, or both.

[0036] 3. Half-select bits—where writing to a storage cell in aparticular row or column causes another storage cell in the same row orcolumn to change state. A cell which is vulnerable to half select willtherefore possibly change state in response to a write access to anystorage cell in the same row or column, resulting in unreliable storeddata.

[0037] 4. Single failed bits—where a particular storage cell fails (e.g.is stuck always as a “0”), but does not affect other storage cells andis not affected by activity in other storage cells.

[0038] These four example failure mechanisms are each systematic, inthat the same storage cell or cells are consistently affected. Where thefailure mechanism affects only one cell, this can be termed an isolatedfailure. Where the failure mechanism affects a group of cells, this canbe termed a grouped failure.

[0039] Whilst the storage cells of the MRAM device can be used to storedata according to any suitable logical layout, data is preferablyorganised into basic data units (e.g. bytes) which in turn are groupedinto larger logical data units (e.g. sectors). A physical failure, andin particular a grouped failure affecting many cells, can affect manybytes and possibly many sectors. It has been found that keepinginformation about cells, bytes or even sectors affected by physicalfailures is not efficient, due to the quantity of data involved. Thatis, attempts to produce a list of all logical data units renderedunusable due to at least one physical failure, tend to generate aquantity of management data which is too large to handle efficiently.Further, depending on how the data is organised on the device, a singlephysical failure can potentially affect a large number of logical dataunits, such that avoiding use of all bytes, sectors or other unitsaffected by a failure substantially reduces the storage capacity of thedevice. For example, a grouped failure such as a shorted bit failure injust one storage cell affects many other storage cells, which lie in thesame row or the same column. Thus, a single shorted bit failure canaffect 1023 other cells lying in the same row, and 1023 cells lying inthe same column—a total of 2027 affected cells. These 2027 affectedcells may form part of many bytes, and many sectors, each of which wouldbe rendered unusable by the single grouped failure.

[0040] Some improvements have been made in manufacturing processes anddevice construction to reduce the number of manufacturing failures andimprove device longevity, but this usually involves increasedmanufacturing costs and complexity, and reduced device yields. Hence,techniques are being developed which respond to failures and avoidfuture loss of data. One example technique is the use of sparing. A rowidentified as containing failures is made redundant (spared) andreplaced by one of a set of unused additional spare rows, and similarlyfor columns. However, either a physical replacement is required (i.e.routing connections from the failed row or column to instead reach thespare row or column), or else additional control overhead is required tomap logical addresses to physical row and column lines. Only a limitedsparing capacity can be provided, since enlarging the device to includespare rows and columns reduces device density for a fixed area ofsubstrate and increases manufacturing complexity. Therefore, wherefailures are relatively common, sparing is unable to cope leading topossible loss of data. Also, sparing is not useful in handling randomfailures, and involves additional management overhead to determinedeployment of sparing capacity.

[0041] The preferred embodiments of the present invention employ errorcorrection coding to provide a magnetoresistive solid-state storagedevice which is error tolerant, preferably to tolerate and recover fromboth random failures and systematic failures. Typically, errorcorrection coding involves receiving original information which it isdesired to store and forming encoded data which allows errors to beidentified and ideally corrected. The encoded data is stored in thesolid-state storage device. At read time, the original information isrecovered by error correction decoding the encoded stored data. A widerange of error correction coding (ECC) schemes are available and can beemployed alone or in combination. Suitable ECC schemes include bothschemes with single-bit symbols (e.g. BCH) and schemes with multiple-bitsymbols (e.g. Reed-Solomon).

[0042] As general background information concerning error correctioncoding, reference is made to the following publication: W. W. Petersonand E. J. Weldon, Jr., “Error-Correcting Codes”, 2^(nd) edition, 12^(th)printing, 1994, MIT Press, Cambridge Mass.

[0043] A more specific reference concerning Reed-Solomon codes used inthe preferred embodiments of the present invention is: “Reed-SolomonCodes and their Applications”, ED. S. B. Wicker and V. K. Bhargava, IEEEPress, New York, 1994.

[0044]FIG. 2 shows an example logical data structure used in preferredembodiments of the present invention. Original information 200 isreceived in predetermined units such as a sector comprising 512 bytes.Error correction coding is performed to produce a block of encoded data202, in this case an encoded sector. The encoded sector 202 comprises aplurality of symbols 206 which can be a single bit (e.g. a BCH code withsingle-bit symbols) or can comprise multiple bits (e.g. a Reed-Solomoncode using multi-bit symbols). In the preferred Reed-Solomon encodingscheme, each symbol 206 conveniently comprises eight bits. As shown inFIG. 2, the encoded sector 202 comprises four codewords 204, eachcomprising of the order of 144 to 160 symbols. The eight bitscorresponding to each symbol are conveniently stored in eight storagecells 16. A physical failure which affects any of these eight storagecells can result in one or more of the bits being unreliable (i.e. thewrong value is read) or unreadable (i.e. no value can be obtained),giving a failed symbol.

[0045] Error correction decoding the encoded data 202 allows failedsymbols 206 to be identified and corrected. The preferred Reed-Solomonscheme is an example of a linear error correcting code, whichmathematically identifies and corrects completely up to a predeterminedmaximum number of failed symbols 206, depending upon the power of thecode. For example, a [160,128,33] Reed-Solomon code having one hundredand sixty 8-bit symbols corresponding to one hundred and twenty-eightoriginal information bytes and a minimum distance of thirty-threesymbols can locate and correct up to sixteen failed symbols. Suitably,the ECC scheme employed is selected with a power sufficient to recoveroriginal information 200 from the encoded data 202 in substantially allcases. Very rarely, a block of encoded data 202 is encountered which isaffected by so many failures that the original information 200 isunrecoverable. Also, very rarely the failures result in a mis-correct,where information recovered from the encoded data 202 is not equivalentto the original information 200. Even though the recovered informationdoes not correspond to the original information, a mis-correct is notreadily determined and means that the original information isunrecoverable.

[0046] In the current MRAM devices, grouped failures tend to affect alarge group of storage cells, lying in the same row or column. Thisprovides an environment which is unlike prior storage devices. Thepreferred embodiments of the present invention employ an ECC scheme withmulti-bit symbols. Where manufacturing processes and device designchange over time, it may become more appropriate to organise storagelocations expecting bit-based errors and then apply an ECC scheme usingsingle-bit symbols, and at least some the following embodiments can beapplied to single-bit symbols.

[0047]FIG. 3 shows a simplified overview of a preferred method forcontrolling the MRAM device 1 of FIG. 1.

[0048] Step 301 comprises accessing a plurality of the storage cells 16of the MRAM device. Preferably, the plurality of storage cellscorrespond to a block of encoded data, such as a codeword 204, or anencoded sector 202. Suitably, a plurality of read operations areperformed by accessing the plurality of cells 16 using the row andcolumn control lines 12 and 14. The read operations provide logical bitvalues which are used to form the symbols 206, and the symbols in turnare built into a complete logical block of data such as the codeword204. In this example, four codewords 204 together form a completeencoded sector 202, from which the original information sector 200 canbe recovered.

[0049] Step 302 comprises determining whether original information isunrecoverable from the block of encoded data. That is, the step 302comprises determining whether decoding the block of encoded data isexpected not to be able to produce recovered information, or determiningwhether attempting to decode the block of encoded data does not producerecovered information. The determining step can be performed by ECCdecoding the block of encoded data as a logical evaluation technique, orcan be performed using physical evaluation techniques, and preferably acombination of both logical and physical techniques are employed as willbe described in more detail below.

[0050] Where step 302 determines that ECC decoding has not producedrecovered information, or is not expected to produce recoveredinformation, then remedial action is taken in step 304. Otherwise, useof the cells continues in step 303.

[0051] The remedial action in step 304 may take any suitable form, tomanage future activity in the storage cells 16. As one example, theaccess of step 301 is immediately repeated, in the hope of avoiding somerandom errors and this time obtaining symbol values for the encoded datafrom which the original data can be recovered by ECC decoding. As asecond example, the set of storage cells 16 corresponding to a failedcodeword 204 or to a complete encoded sector 202 are identified anddiscarded, in order to avoid possible loss of data in future. In thecurrently preferred embodiments it is most convenient to use or discardsets of storage cells corresponding to a sector 202, although greater orlesser granularity can be applied as desired.

[0052]FIG. 4 shows a more detailed preferred method for controlling theMRAM device, using logical evaluation of the accessed set of storagecells 16 corresponding to a block of encoded data such as a codeword 204or an encoded sector 202.

[0053] Step 401 comprises accessing the set of storage cells 16,equivalent to step 301 above.

[0054] Step 402 comprises performing ECC decoding of the block ofencoded data obtained by accessing the storage cells in step 401.

[0055] Step 403 comprises determining whether the ECC decoding of step402 was not successful, in the sense that the ECC decoding has notproduced recovered information from the block of data. Where ECCdecoding is not successful, it is not possible to recover the originaldata 200 from the accessed storage cells 16, and remedial action can betaken as in step 304.

[0056] Optionally, the method includes the step 404 of determining thenumber of failed symbols identified by the ECC decoding of step 402, andcomparing the identified number of failures against a threshold value. Aphysical failure in any of the accessed set of storage cells can resultin a failed symbol. The threshold value selected for the comparison ispreferably in the range of between about 50% and 95% of the maximumnumber of failures that can be corrected by performing the ECC decodingof step 402. The threshold value in step 404 is selected on the basisthat although a number of failures have been identified in thisparticular block of data, it is still reasonable to continue using theselected set of storage cells with the expectation of still being ableto successfully perform ECC decoding next time those cells are accessed.The threshold value in step 404 provides a safety margin allowing afurther failure or failures to occur in the next access, whilst stillallowing a successful ECC decoding to be performed.

[0057] In almost all practical cases, the ECC scheme employed issufficiently powerful to provide recovered information equivalent to theoriginal information sector 200. The original information 200 is outputfrom the MRAM device in step 405.

[0058] The method of FIG. 4 is conveniently employed whilst the MRAMdevice is in use. Suitably, the method of FIG. 4 is applied whilst thedevice stores variable user data, allowing dynamic management of datastorage in the device. For example, it is possible that the number ofsystematic errors will increase as the device ages. A small number ofsets of storage cells such as sectors 202 will become unreliable andshould be removed from active use as a remedial action. However, it isexpected that most sectors will continue in use reliably, by employing asuitable ECC scheme.

[0059] Additionally or alternatively, the method of FIG. 4 isconveniently applied when the MRAM device is first manufactured, or isfirst installed, or at power up, or at convenient times subsequentlysuch as a periodic check. Suitably, a sample of test data is applied toa block such as a sector, and the test method of FIG. 4 performed toestablish the suitability of that sector for future use.

[0060]FIG. 5 shows a second preferred method for controlling the MRAMdevice 1. As in FIGS. 3 and 4, the method is intended for use with alogical block of data such as codeword 204 or an encoded sector 202.

[0061] In step 501 the set of storage cells corresponding to the blockof data are accessed, preferably in a set of read operations.

[0062] Step 502 comprises obtaining a plurality of parametric valuesassociated with the accessed set of storage cells from the access ofstep 401. Suitably, a read voltage is applied along the row and columncontrol lines 12, 14 causing a sense current to flow through selectedstorage cells 16, which have a resistance determined by parallel oranti-parallel alignment of the two magnetic films. The resistance of aparticular cell is determined according to a phenomenon known as spintunnelling and the cells are often referred to as magnetic tunneljunction storage cells. The condition of the storage cell is determinedby measuring the sense current (proportional to resistance) or a relatedparameter such as response time to discharge a known capacitance.

[0063] Step 503 comprises comparing the obtained parametric values toone or more predicted ranges. The comparison of step 503 in almost allcases allows a logical value (e.g. one or zero) to be established foreach cell. However, the comparison also conveniently allows at leastsome forms of physical failure to be identified. For example, it hasbeen determined that a shorted bit failure leads to a very lowresistance value in all cells of a particular row and a particularcolumn. Also, open-bit failures can cause a very high resistance valuefor all cells of a particular row and column. By comparing the obtainedparametric values against predicted ranges, cells affected by failuressuch as shorted-bit and open-bit failures can be identified with a highdegree of certainty.

[0064]FIG. 6 is a graph as an illustrative example of the probability(p) that a particular cell will have a certain parametric value, in thiscase resistance (r), corresponding to a logical “0” in the left-handcurve, or a logical “1” in the right-hand curve. As an arbitrary scale,probability has been given between 0 and 1, whilst resistance is plottedbetween 0 and 100%. The resistance scale has been divided into fiveranges. In range 601, the resistance value is very low and the predictedrange represents a shorted-bit failure with a reasonable degree ofcertainty. Range 602 represents a low resistance value within expectedboundaries, which in this example is determined as equivalent to alogical “0”. Range 603 represents a medium resistance value where alogical value cannot be ascertained with any degree of certainty. Range604 is a high resistance range representing a logical “1”. Range 605 isa very high resistance value where an open-bit failure can be predictedwith a high degree of certainty. The ranges shown in FIG. 6 are purelyfor illustration, and many other possibilities are available dependingupon the physical construction of the MRAM device 1, the manner in whichthe storage cells are accessed, and the parametric values obtained. Therange or ranges are suitably calibrated depending, for example, onenvironmental factors such as temperature, factors affecting aparticular cell or cells and their position within the array, or thenature of the cells themselves and the type of access employed.

[0065] Referring again to FIG. 5, step 504 comprises counting a numberof physical failures, as identified in the comparison of step 503.Suitably, the count of parametric failures in step 504 is performed onthe basis of the number of symbols 206 (each containing one or morebits) which are affected by the identified physical failures.

[0066] Step 505 comprises comparing the number of parametric failures,i.e. the number of failed symbols identified by parametric testing,against a predetermined threshold value. The number of physical failurescan be represented in any suitable form. Depending upon the nature ofthe ECC scheme employed, some types of failure can be weighteddifferently to other types of failure. Since the data stored in thestorage cells represents encoded data, it is expected that ECC decodingwill not be able to recover the original data, where the number ofparametric failures is greater than the maximum power of the ECC scheme.Hence, the threshold value is suitably selected to represent a valuewhich is equal to or less than the maximum number of failures which theECC scheme employed is able to correct. Preferably, the threshold valuein step 505 is selected to be substantially less than the maximum powerof the ECC decoding scheme, suitably of the order of 50% to 95% of themaximum power. In a particular preferred embodiment the threshold valuein step 505 is selected to represent about 50% to 75% and suitably about60% of the maximum power of the employed ECC scheme. Preferably, thestep 505 comprises determining the number of parametric failures to begreater than the threshold value, such that performing ECC decoding isexpected (with a sufficiently high probability) not to be able torecover information from the encoded data. That is, where the number ofparametric failures is greater than the threshold value, there is agreater than acceptable probability that information is unrecoverablefrom the encoded data.

[0067] Step 506 comprises determining whether or not to continue use ofthe set of cells corresponding to the accessed block of data, in view ofthe number of parametric failures which have been identified. Ifdesired, remedial action can be taken as outlined in step 304.

[0068] The physical evaluation of FIG. 5 is particularly useful as atest procedure immediately following manufacture of the device, or atinstallation, or at power up, or at any convenient time subsequently. Inone example, the test procedure of FIG. 5 is performed by writing a testset of data to the device and then reading from the device, or by anyother suitable parametric testing. In particular, it is useful to applythe method of FIG. 5 to identify areas of the MRAM device which areseverely affected by systematic errors caused by manufacturingimperfections, and remedial action can then be taken before the deviceis put into active use storing variable user data. In the preferredembodiment, each sector comprises four codewords, and a sector is maderedundant where any one of its four codewords contains a number ofparametric failures which is greater than the threshold value of step505. A block of data such as an encoded sector 202 having a number offailed symbols greater than the threshold value is not used at all inthe subsequent life span of the device, because the probability ofunrecoverable data errors would be too high. The threshold value used inthe test procedure is set such that at least one and preferably severalfailures occurring subsequently will be tolerated. In particular, thethreshold value is set to allow further systematic failures to betolerated together with at least one and preferably several randomfailures, in a block of data.

[0069] The parametric evaluation of FIG. 5 is particularly useful indetermining shorted-bit and/or open-bit failures in MRAM devices. Asystematic failure, such as a half select or some forms of isolated bitfailure, is not so easily detectable using parametric tests, but is morereadily discovered by logical evaluation using ECC decoding as in FIG.4. Therefore, in particularly preferred embodiments of the presentinvention the logical evaluation of FIG. 4 is combined with theparametric evaluation of FIG. 5 to provide a practical device which isable to take advantage of the considerable benefits offered by the newMRAM technology whilst minimising the limitations of current availablemanufacturing techniques.

[0070] The MRAM device described herein is ideally suited for use inplace of any prior solid-state storage device. In particular, the MRAMdevice is ideally suited both for use as a short-term storage device(e.g. cache memory) or a longer-term storage device (e.g. a solid-statehard disk). An MRAM device can be employed for both short term storageand longer term storage within a single apparatus, such as a computingplatform.

[0071] A magnetoresistive solid-state storage device and methods forcontrolling such a device have been described. Advantageously, thestorage device is able to tolerate a relatively large number of errors,including both systematic failures and transient failures, whilstsuccessfully remaining in operation with no loss of original data.Simpler and lower cost manufacturing techniques are employed and/ordevice yield and device density are increased. As manufacturingprocesses improve, overhead of the employed ECC scheme can be reduced.However, error correction coding and decoding allows blocks of data,e.g. sectors or codewords, to remain in use, where otherwise the wholeblock must be discarded if only one failure occurs. Therefore, thepreferred embodiments of the present invention avoid large scalediscarding of logical blocks and reduce or even eliminate completely theneed for inefficient control methods such as large-scale data mappingmanagement or physical sparing.

1. A method for controlling a magnetoresistive solid-state storagedevice having a plurality of storage cells for storing a block of ECCencoded data, the method comprising the steps of: accessing a set of theplurality of storage cells; and determining whether information isunrecoverable from a block of ECC encoded data stored in the accessedstorage cells.
 2. The method of claim 1, comprising determining whetherinformation is unrecoverable, by attempting to perform ECC decoding ofthe block of ECC encoded data.
 3. The method of claim 2, comprisingcontinuing use of the set of storage cells, if the ECC decoding recoversinformation from the block of ECC encoded data.
 4. The method of claim2, comprising taking remedial action concerning the set of storagecells, if the ECC decoding does not recover information from the blockof ECC encoded data.
 5. The method of claim 2, comprising identifying,from the ECC decoding, zero or more failed symbols in the block of ECCencoded data; and comparing the identified number of failed symbolsagainst a threshold value.
 6. The method of claim 1, comprisingdetermining whether original information is expected to be unrecoverablefrom a block of ECC encoded data stored in the accessed set of storagecells.
 7. The method of claim 6, wherein original information isexpected to be unrecoverable because a probability of failing tocorrectly perform ECC decoding of the block of ECC encoded data isunacceptably high.
 8. The method of claim 6, comprising continuing useof the set of storage cells, when original information is not expectedto be unrecoverable from the block of ECC encoded data stored in theaccessed storage cells.
 9. The method of claim 8, comprising takingremedial action concerning the set of storage cells, when originalinformation is expected to be unrecoverable from a block of ECC encodeddata stored in the accessed storage cells.
 10. The method of claim 6,comprising determining, from accessing the set of storage cells, failedsymbols in the block of ECC encoded data that have been affected by aphysical failure.
 11. The method of claim 10, comprising determiningthat there are more failed symbols in the block of ECC encoded data thancan be corrected by error correction decoding the block of ECC encodeddata.
 12. The method of claim 10, comprising determining that due tofailed symbols in the block of ECC encoded data, there is anunacceptable probability that decoding the block of ECC encoded datawill not correctly recover original information.
 13. The method of claim6, comprising obtaining a parametric value for each of the set ofstorage cells, and comparing each parametric value against a range orranges.
 14. The method of claim 13, comprising deriving a logical bitvalue for each storage cell, as a result of comparing each parametricvalue against a range or ranges.
 15. The method of claim 13, comprisingidentifying a cell or cells, amongst the set of storage cells, as beingaffected by a physical failure.
 16. The method of claim 15, wherein thedetermining step comprises comparing a failure count based on theidentified cells against a threshold value.
 17. The method of claim 16,wherein the threshold value represents a number of failed symbols equalto or less than a total number of failed symbols which can be correctedby error correction decoding the block of ECC encoded data.
 18. Themethod of claim 15, comprising using the identified cells to determinefailed symbols, and comparing a count of the failed symbols against thethreshold value.
 19. The method of claim 18, wherein the threshold valueis set to be in the range of about 50% to about 95% of the maximumnumber of failed symbols which can be corrected by error correctiondecoding the block of ECC encoded data.
 20. The method of claim 6,comprising selectively ECC decoding the block of ECC encoded data inresponse to the determining step.
 21. The method of claim 1, wherein theblock of encoded data corresponds to a sector of original information.22. The method of claim 1, wherein the block of ECC encoded data is acodeword, and wherein a plurality of codewords are grouped to form anencoded sector corresponding to a sector of original information. 23.The method of claim 1, performed prior to use of the storage device. 24.The method of claim 1, performed during use of the storage device.
 25. Amethod for controlling a magnetoresistive solid-state storage device,comprising the steps of: receiving original information which it isdesired to store; error correction encoding the original information toform a block of ECC encoded data; storing the block of ECC encoded datain a set of magnetoresistive storage cells arranged in at least onearray; accessing the set of storage cells; forming logical symbol valuesof the block of ECC encoded data from the accessed set of storage cells;error correction decoding the block of ECC encoded data to providerecovered information; if the decoding step provides recoveredinformation then outputting the recovered information and continuing useof the set of storage cells, or else if the decoding step did notprovide recovered information then taking remedial action in respect ofthe set of storage cells.
 26. The method of claim 25, comprising:identifying, from the ECC decoding, zero or more failed symbols in theblock of ECC encoded data; comparing the identified number of failedsymbols against a threshold value; and if the ECC decoding did notrecover original information, or if the identified number of failedsymbols is greater than the threshold value, then taking remedial actionconcerning the accessed set of storage cells.
 27. A method forcontrolling a magnetoresistive solid-state storage device, comprisingthe steps of: receiving original information which it is desired tostore; error correction encoding the original information to form ablock of ECC encoded data; storing the block of ECC encoded data in aset of magnetoresistive storage cells arranged in at least one array;accessing the set of storage cells; comparing parametric values obtainedby accessing the set of storage cells against one or more ranges;identifying failed cells amongst the accessed set of cells; forming afailure count based on the identified failed cells; comparing thefailure count against a threshold value; and determining whether theoriginal information is expected to be unrecoverable from the block ofECC encoded data stored in the accessed set of storage cells.
 28. Themethod of claim 27, comprising selectively attempting error correctiondecoding of the block of ECC encoded data, when original information isnot expected to be unrecoverable, or else taking remedial action for theaccessed set of storage cells where original information is expected tobe unrecoverable.
 29. The method of claim 28, wherein comparing thefailure count against the threshold value indicates a probability offailing to correctly perform ECC decoding on the block of ECC encodeddata as acceptable or unacceptable.
 30. The method of claim 27, whereinthe failure count is based on a number of failed symbols in the block ofECC encoded data, the failed symbols being identified with reference tothe failed cells.
 31. The method of claim 27, wherein the thresholdvalue represents about 50% to about 95% of the maximum number of failedsymbols which can be corrected by error correction decoding the block ofECC encoded data.
 32. A magnetoresistive solid-state storage device,comprising: at least one array of magnetoresistive storage cells; a ECCencoding unit for forming a block of ECC encoded data from a unit oforiginal information; and a controller arranged to store the block ofECC encoded data in a set of the storage cells, access the set ofstorage cells, and determine whether the original information isunrecoverable from the block of ECC encoded data stored in the accessedset of storage cells.
 33. An apparatus comprising the magnetoresistivesolid-state storage device of claim 32.